library(car)# avPlots
library(carData)
library(MASS)
library(ggplot2)
library(plotly)
library(dplyr)
library(GGally) # ggpairs
library(leaps)
library(olsrr)
Load data
bodyF <- read.csv("./bodyfatmen.csv",
header = TRUE)
View(bodyF)
n <- nrow(bodyF)
p <- ncol(bodyF) # k+1
Create full model
bodyF.model <- lm(density ~ ., data = bodyF)
summary(bodyF.model)
Call:
lm(formula = density ~ ., data = bodyF)
Residuals:
Min 1Q Median 3Q Max
-0.0225107 -0.0071735 0.0002816 0.0064878 0.0254670
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.156e+00 5.061e-02 22.846 < 2e-16 ***
age -1.320e-04 7.392e-05 -1.785 0.07550 .
weight 2.378e-04 1.408e-04 1.689 0.09254 .
height -2.594e-05 4.083e-04 -0.064 0.94939
neck 1.072e-03 5.371e-04 1.995 0.04720 *
chest 1.169e-05 2.360e-04 0.050 0.96056
abdomen -2.200e-03 2.072e-04 -10.618 < 2e-16 ***
hip 5.268e-04 3.336e-04 1.579 0.11569
thigh -6.343e-04 3.336e-04 -1.901 0.05849 .
knee -3.418e-05 5.640e-04 -0.061 0.95172
ankle -4.449e-04 5.107e-04 -0.871 0.38459
biceps -4.274e-04 3.942e-04 -1.084 0.27940
forearm -1.040e-03 4.527e-04 -2.298 0.02245 *
wrist 3.651e-03 1.227e-03 2.976 0.00322 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.009781 on 234 degrees of freedom
Multiple R-squared: 0.7451, Adjusted R-squared: 0.731
F-statistic: 52.63 on 13 and 234 DF, p-value: < 2.2e-16
We see that the model have a lot of confidence with abdomen and wrist but no so much with height, chest, hip, knee, ankle and biceps
# Normal probability plot of residuals
plot(bodyF.model, which=2)
Seems pretty good, so we have a relatively strong reason to beleive the normality assumtion of the residual. However we see a bit of light tails. The points 39, 203 and 220 could be outliers since they do not follow the normality hypothesis (but only 203 and 220 seem more severe).
# Residuals vs. fitted values
plot(bodyF.model, which=c(1,3))
plot(studres(bodyF.model), xlab="Fitted values", ylab="Studentized residual")
plot(rstudent(bodyF.model), xlab="Fitted values", ylab="R-Student residual")
Semms very good, we do not see any aparent shape, which corresponds to a uncorrelated fitted values with the residuals. We can see in the standarized residuals vs fitted values, that the points 39, 203, and 220 have a standarized residual higher than the variance, which could indicate us that they are outliers.
These plots also indicate us that we might not need to transform our y variable. Nevertheless we can try to find the sugested transformation by the Cox-Box method to further confrirm our hypothesis.
We observe that the power which maximixes the maximum likelihood is -4. However, we see that the CI are relatively long and that they also contain 1.
We can try using the transformation of y^-4 and see if it improves the prior plots.
bodyF.model.trans <- lm(density ~ ., data = bodyF.extra)
summary(bodyF.model.trans)
Call:
lm(formula = density ~ ., data = bodyF.extra)
Residuals:
Min 1Q Median 3Q Max
-0.073398 -0.019485 -0.001324 0.021791 0.071225
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.393e-01 1.534e-01 3.515 0.000529 ***
age 3.707e-04 2.241e-04 1.654 0.099508 .
weight -6.524e-04 4.268e-04 -1.529 0.127720
height -3.845e-04 1.238e-03 -0.311 0.756427
neck -3.170e-03 1.629e-03 -1.947 0.052776 .
chest -8.605e-05 7.157e-04 -0.120 0.904406
abdomen 6.732e-03 6.281e-04 10.718 < 2e-16 ***
hip -1.367e-03 1.012e-03 -1.351 0.178036
thigh 1.545e-03 1.012e-03 1.527 0.128114
knee -1.501e-04 1.710e-03 -0.088 0.930130
ankle 1.438e-03 1.549e-03 0.929 0.354081
biceps 1.171e-03 1.195e-03 0.979 0.328357
forearm 3.200e-03 1.373e-03 2.331 0.020594 *
wrist -1.153e-02 3.720e-03 -3.099 0.002180 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02966 on 234 degrees of freedom
Multiple R-squared: 0.7525, Adjusted R-squared: 0.7388
F-statistic: 54.74 on 13 and 234 DF, p-value: < 2.2e-16
# Normal probability plot of residuals
plot(bodyF.model.trans, which=2)
# Residuals vs. fitted values
plot(bodyF.model.trans, which=c(1,3))
plot(studres(bodyF.model.trans), xlab="Fitted values", ylab="Studentized residual")
plot(rstudent(bodyF.model.trans), xlab="Fitted values", ylab="R-Student residual")
As we can see there is not a considerable improvement, neither in fixing the light tails. In addition to the long CI of the Box-Cox, we consider not necessary to do any transformation to the y varaiable.
### Horizontal band, so satisfactory distribution
#par(mfrow = c(3, 3))
# Residuals vs. regressor variables
plot(bodyF$age, bodyF.model$residuals)
plot(bodyF$weight, bodyF.model$residuals)
plot(bodyF$height, bodyF.model$residuals)
plot(bodyF$neck, bodyF.model$residuals)
plot(bodyF$chest, bodyF.model$residuals)
plot(bodyF$abdomen, bodyF.model$residuals)
plot(bodyF$hip, bodyF.model$residuals)
plot(bodyF$thigh, bodyF.model$residuals)
plot(bodyF$knee, bodyF.model$residuals)
plot(bodyF$ankle, bodyF.model$residuals)
plot(bodyF$biceps, bodyF.model$residuals)
plot(bodyF$forearm, bodyF.model$residuals)
plot(bodyF$wrist, bodyF.model$residuals)
We see that all the plots have a rectangular shape, indicating that the uncorrelation between the regresors and the residuals.
# Partial residual plots
avPlots(bodyF.model)
These plots, with the previous ones, shows us that it might not be necesary to transform our regressors since all the plots follow a line. Also we see that that the slope of the height, chest and knee is zero, while weight, neck, abdomen, hip, forearm (due to outliers) and wrist have greater slope. Thus, this indicates that we should priorize the variables with higher slope.
# PRESS statistics
pr <- resid(bodyF.model)/(1 - lm.influence(bodyF.model)$hat)
PRESS <- sum(pr^2)
SSt <- sum((bodyF$density - mean(bodyF$density))^2)
R2prediction <- 1 - PRESS/SSt
print(R2prediction)
[1] 0.7052601
We could expect this model to explain about 70.52% of the variability in predicting new observations. Which is not as desirable, but it is not that much of a trade off compared to the R2 0.7451.
# Define cutoff
leverage.cutoff <- 2 * p / n # MPV p. 213
cooks.cutoff <- qf(0.5, p, n - p, lower.tail = FALSE) # MPV p. 215
dfbetas.cutoff <- 2 / sqrt(n) # MPV p. 218
dffits.cutoff <- 2 * sqrt(p / n) # MPV p. 219
studres.cutoff <- qt(0.05 / 2, n - p, lower.tail = FALSE) # MPV p. 135
### leverage points
bodyF.hat <- hatvalues(bodyF.model)
a <-bodyF.hat[bodyF.hat > leverage.cutoff]
print(a)
5 31 36 39 41 52 83 102 155 171 202 212
0.1221526 0.3266160 0.2006317 0.4580794 0.2170594 0.1694985 0.3631388 0.1817106 0.1879068 0.2729657 0.1836120 0.1312186
View(bodyF[names(a),])
plot(bodyF.model, which=5)
We can see that none of the points are considered as influentials based on the Cooks distance cutoff. However, we have the points 39 and 83 which are high leverage points and are the closest to being influential. Analizing the data we observe that the point 39 corresponts to the heaviest individual (having a 100 pound gap), so that translates to having high leverage. Moreover, as we saw in the stundentized vs fitted plot it had one of the highest residual (no so good of a fit), so that made high more influential than the rest.
plot(bodyF.model, which=4)
This plot confirms our findings of not having a influential point based on Cooks distance, however we see that the point 39 has a much high value compared to the rest.
# DFFITS
bodyF.model.extra <- data.frame(fitted.values= bodyF.model$fitted.values, dffits= dffits(bodyF.model))
bodyF.model.extra[abs(bodyF.model.extra[,"dffits"]) > dffits.cutoff,]
pp <- ggplot(bodyF.model.extra, aes(x=fitted.values, y=dffits)) +
geom_point() + geom_line(data=bodyF.model.extra, aes(x=fitted.values, y=dffits.cutoff), col="red", linetype = "dashed") +geom_line(data=bodyF.model.extra, aes(x=fitted.values, y=-dffits.cutoff), col="red", linetype = "dashed")
ggplotly(pp, tooltip="text")
When we analyze the dffits we observe that there are multiple pointsthat pass our threshold, which could imply that we should change the standard threshold to a more convenient one for our porpuse. Nevertheless, the only points that are considerably over the threshold are 39 and 83, which we already observed their influence by the Cooks distance.
Now that we know the influence in the fit of the points 39 and 83 we need to analyse the points and consider if they are trully an outlier or we can compare the models generated without them and consider mantaining them.
bodyF.model.out1 <- lm(density ~ ., data = bodyF[-39,])
bodyF.model.out2 <- lm(density ~ ., data = bodyF[-83,])
bodyF.model.out3 <- lm(density ~ ., data = bodyF[c(-39 -83),])
print("All points")
[1] "All points"
summary(bodyF.model)["adj.r.squared"]
$adj.r.squared
[1] 0.7309834
print("Without 39")
[1] "Without 39"
summary(bodyF.model.out1)["adj.r.squared"]
$adj.r.squared
[1] 0.7348898
print("Without 83")
[1] "Without 83"
summary(bodyF.model.out2)["adj.r.squared"]
$adj.r.squared
[1] 0.7332889
print("Without 39 and 83")
[1] "Without 39 and 83"
summary(bodyF.model.out3)["adj.r.squared"]
$adj.r.squared
[1] 0.7315509
We see that there is no much diference in the adjusted R2 when we remove the influential points. Therefore, we conclude that we are going to maintain. In addition, we want to mention that their values seem like possible body proportions, so we might have some added interest in mantaining them.
Since most of our regressors correnspond to dimension of the body, then we already expect to have some correlation between them. Also we expect that the weight will be correlated to some regressors as well.
# Correlation matrix
ggpairs(data = bodyF[,-1])
We can see that the most uncorrelated variable is the age, and then the height. However we have high correlations between the rest body dimensions as we expected.
# Variance inflation factor (VIF)
# Cutoff is 10
vif(bodyF.model)
age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist
2.256000 43.944746 2.865731 4.391047 10.165371 12.881638 14.546865 7.815291 4.744625 1.952864 3.683412 2.172323 3.354584
We see that weight, chest (merely), abdomen and hip surpass the cutoff of 10 in the VIF, which indicate us that these variables have a great dependance with the rest. We will iterativelly remove this variables and recalculate the VIFs untill there is no more variables with value greater than 10.
bodyF.noWeight <- bodyF[,-3]
bodyF.model.red1 <- lm(density ~ ., data = bodyF.noWeight)
vif(bodyF.model.red1)
age height neck chest abdomen hip thigh knee ankle biceps forearm wrist
2.186067 1.727386 3.896814 7.886385 11.616571 10.767727 7.674578 4.691659 1.869511 3.515374 2.172265 3.299420
We observe that we have reduced considerably the multicolliniarity. Now we are going to remove abdomen.
bodyF.noWeight_noAbs <- bodyF[,c(-3, -7)]
bodyF.model.red1 <- lm(density ~ ., data = bodyF.noWeight_noAbs)
vif(bodyF.model.red1)
age height neck chest hip thigh knee ankle biceps forearm wrist
1.827180 1.700645 3.867916 5.014769 8.523918 7.547401 4.685981 1.860100 3.480125 2.167945 3.269275
We still have some multicolliniarity such as chest, hip and thight, but not as severe as before. Therefore we can procede to variable selection with this reduced data.
bodyF.model.red2 <- regsubsets(density~., data = bodyF.noWeight_noAbs, nvmax = 11)
a<-summary(bodyF.model.red2)
summary(bodyF.model.red2)
Subset selection object
Call: regsubsets.formula(density ~ ., data = bodyF.noWeight_noAbs,
nvmax = 11)
11 Variables (and intercept)
Forced in Forced out
age FALSE FALSE
height FALSE FALSE
neck FALSE FALSE
chest FALSE FALSE
hip FALSE FALSE
thigh FALSE FALSE
knee FALSE FALSE
ankle FALSE FALSE
biceps FALSE FALSE
forearm FALSE FALSE
wrist FALSE FALSE
1 subsets of each size up to 11
Selection Algorithm: exhaustive
age height neck chest hip thigh knee ankle biceps forearm wrist
1 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " "
2 ( 1 ) " " "*" " " "*" " " " " " " " " " " " " " "
3 ( 1 ) "*" " " " " "*" " " "*" " " " " " " " " " "
4 ( 1 ) "*" " " " " "*" " " "*" " " " " " " " " "*"
5 ( 1 ) "*" "*" " " "*" " " "*" " " " " " " " " "*"
6 ( 1 ) "*" "*" " " "*" "*" "*" " " " " " " " " "*"
7 ( 1 ) "*" "*" "*" "*" "*" "*" " " " " " " " " "*"
8 ( 1 ) "*" "*" "*" "*" "*" "*" " " " " " " "*" "*"
9 ( 1 ) "*" "*" "*" "*" "*" "*" " " " " "*" "*" "*"
10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
print("Best Adjusted R2")
[1] "Best Adjusted R2"
which.max(summary(bodyF.model.red2)$adjr2)
[1] 8
print("Best Cp")
[1] "Best Cp"
which.min(summary(bodyF.model.red2)$cp)
[1] 8
print("Best BIC")
[1] "Best BIC"
which.min(summary(bodyF.model.red2)$bic)
[1] 4
all_possible_res <- ols_step_all_possible(bodyF.model.red1, metric = c("rsquare", "adjr", "cp", "aic", "sbic", "msep"))
all_possible_res
plot(all_possible_res)
It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.